373 research outputs found

    Attention: A Big Surprise for Cross-Domain Person Re-Identification

    Full text link
    In this paper, we focus on model generalization and adaptation for cross-domain person re-identification (Re-ID). Unlike existing cross-domain Re-ID methods, leveraging the auxiliary information of those unlabeled target-domain data, we aim at enhancing the model generalization and adaptation by discriminative feature learning, and directly exploiting a pre-trained model to new domains (datasets) without any utilization of the information from target domains. To address the discriminative feature learning problem, we surprisingly find that simply introducing the attention mechanism to adaptively extract the person features for every domain is of great effectiveness. We adopt two popular type of attention mechanisms, long-range dependency based attention and direct generation based attention. Both of them can perform the attention via spatial or channel dimensions alone, even the combination of spatial and channel dimensions. The outline of different attentions are well illustrated. Moreover, we also incorporate the attention results into the final output of model through skip-connection to improve the features with both high and middle level semantic visual information. In the manner of directly exploiting a pre-trained model to new domains, the attention incorporation method truly could enhance the model generalization and adaptation to perform the cross-domain person Re-ID. We conduct extensive experiments between three large datasets, Market-1501, DukeMTMC-reID and MSMT17. Surprisingly, introducing only attention can achieve state-of-the-art performance, even much better than those cross-domain Re-ID methods utilizing auxiliary information from the target domain

    Maximum mutual information regularized classification

    Full text link
    In this paper, a novel pattern classification approach is proposed by regularizing the classifier learning to maximize mutual information between the classification response and the true class label. We argue that, with the learned classifier, the uncertainty of the true class label of a data sample should be reduced by knowing its classification response as much as possible. The reduced uncertainty is measured by the mutual information between the classification response and the true class label. To this end, when learning a linear classifier, we propose to maximize the mutual information between classification responses and true class labels of training samples, besides minimizing the classification error and reduc- ing the classifier complexity. An objective function is constructed by modeling mutual information with entropy estimation, and it is optimized by a gradi- ent descend method in an iterative algorithm. Experiments on two real world pattern classification problems show the significant improvements achieved by maximum mutual information regularization

    Dual Purpose Hashing

    Full text link
    Recent years have seen more and more demand for a unified framework to address multiple realistic image retrieval tasks concerning both category and attributes. Considering the scale of modern datasets, hashing is favorable for its low complexity. However, most existing hashing methods are designed to preserve one single kind of similarity, thus improper for dealing with the different tasks simultaneously. To overcome this limitation, we propose a new hashing method, named Dual Purpose Hashing (DPH), which jointly preserves the category and attribute similarities by exploiting the Convolutional Neural Network (CNN) models to hierarchically capture the correlations between category and attributes. Since images with both category and attribute labels are scarce, our method is designed to take the abundant partially labelled images on the Internet as training inputs. With such a framework, the binary codes of new-coming images can be readily obtained by quantizing the network outputs of a binary-like layer, and the attributes can be recovered from the codes easily. Experiments on two large-scale datasets show that our dual purpose hash codes can achieve comparable or even better performance than those state-of-the-art methods specifically designed for each individual retrieval task, while being more compact than the compared methods.Comment: With supplementary materials added to the en

    Fully Learnable Group Convolution for Acceleration of Deep Neural Networks

    Full text link
    Benefitted from its great success on many tasks, deep learning is increasingly used on low-computational-cost devices, e.g. smartphone, embedded devices, etc. To reduce the high computational and memory cost, in this work, we propose a fully learnable group convolution module (FLGC for short) which is quite efficient and can be embedded into any deep neural networks for acceleration. Specifically, our proposed method automatically learns the group structure in the training stage in a fully end-to-end manner, leading to a better structure than the existing pre-defined, two-steps, or iterative strategies. Moreover, our method can be further combined with depthwise separable convolution, resulting in 5 times acceleration than the vanilla Resnet50 on single CPU. An additional advantage is that in our FLGC the number of groups can be set as any value, but not necessarily 2^k as in most existing methods, meaning better tradeoff between accuracy and speed. As evaluated in our experiments, our method achieves better performance than existing learnable group convolution and standard group convolution when using the same number of groups.Comment: Accepted by CVPR 201

    Learning Expressionlets via Universal Manifold Model for Dynamic Facial Expression Recognition

    Full text link
    Facial expression is temporally dynamic event which can be decomposed into a set of muscle motions occurring in different facial regions over various time intervals. For dynamic expression recognition, two key issues, temporal alignment and semantics-aware dynamic representation, must be taken into account. In this paper, we attempt to solve both problems via manifold modeling of videos based on a novel mid-level representation, i.e. \textbf{expressionlet}. Specifically, our method contains three key stages: 1) each expression video clip is characterized as a spatial-temporal manifold (STM) formed by dense low-level features; 2) a Universal Manifold Model (UMM) is learned over all low-level features and represented as a set of local modes to statistically unify all the STMs. 3) the local modes on each STM can be instantiated by fitting to UMM, and the corresponding expressionlet is constructed by modeling the variations in each local mode. With above strategy, expression videos are naturally aligned both spatially and temporally. To enhance the discriminative power, the expressionlet-based STM representation is further processed with discriminant embedding. Our method is evaluated on four public expression databases, CK+, MMI, Oulu-CASIA, and FERA. In all cases, our method outperforms the known state-of-the-art by a large margin.Comment: 12 page

    Structure Inference Net: Object Detection Using Scene-Level Context and Instance-Level Relationships

    Full text link
    Context is important for accurate visual recognition. In this work we propose an object detection algorithm that not only considers object visual appearance, but also makes use of two kinds of context including scene contextual information and object relationships within a single image. Therefore, object detection is regarded as both a cognition problem and a reasoning problem when leveraging these structured information. Specifically, this paper formulates object detection as a problem of graph structure inference, where given an image the objects are treated as nodes in a graph and relationships between the objects are modeled as edges in such graph. To this end, we present a so-called Structure Inference Network (SIN), a detector that incorporates into a typical detection framework (e.g. Faster R-CNN) with a graphical model which aims to infer object state. Comprehensive experiments on PASCAL VOC and MS COCO datasets indicate that scene context and object relationships truly improve the performance of object detection with more desirable and reasonable outputs.Comment: published in CVPR 201

    Temporal Action Detection by Joint Identification-Verification

    Full text link
    Temporal action detection aims at not only recognizing action category but also detecting start time and end time for each action instance in an untrimmed video. The key challenge of this task is to accurately classify the action and determine the temporal boundaries of each action instance. In temporal action detection benchmark: THUMOS 2014, large variations exist in the same action category while many similarities exist in different action categories, which always limit the performance of temporal action detection. To address this problem, we propose to use joint Identification-Verification network to reduce the intra-action variations and enlarge inter-action differences. The joint Identification-Verification network is a siamese network based on 3D ConvNets, which can simultaneously predict the action categories and the similarity scores for the input pairs of video proposal segments. Extensive experimental results on the challenging THUMOS 2014 dataset demonstrate the effectiveness of our proposed method compared to the existing state-of-art methods for temporal action detection in untrimmed videos

    Design and Implementation of a General Decision-making Model in RoboCup Simulation

    Full text link
    The study of the collaboration, coordination and negotiation among different agents in a multi-agent system (MAS) has always been the most challenging yet popular in the research of distributed artificial intelligence. In this paper, we will suggest for RoboCup simulation, a typical MAS, a general decision-making model, rather than define a different algorithm for each tactic (e.g. ball handling, pass, shoot and interception, etc.) in soccer games as most RoboCup simulation teams did. The general decision-making model is based on two critical factors in soccer games: the vertical distance to the goal line and the visual angle for the goalpost. We have used these two parameters to formalize the defensive and offensive decisions in RoboCup simulation and the results mentioned above had been applied in NOVAURO, original name is UJDB, a RoboCup simulation team of Jiangsu University, whose decision-making model, compared with that of Tsinghua University, the world champion team in 2001, is a universal model and easier to be implemented

    Learning Mid-level Words on Riemannian Manifold for Action Recognition

    Full text link
    Human action recognition remains a challenging task due to the various sources of video data and large intra-class variations. It thus becomes one of the key issues in recent research to explore effective and robust representation to handle such challenges. In this paper, we propose a novel representation approach by constructing mid-level words in videos and encoding them on Riemannian manifold. Specifically, we first conduct a global alignment on the densely extracted low-level features to build a bank of corresponding feature groups, each of which can be statistically modeled as a mid-level word lying on some specific Riemannian manifold. Based on these mid-level words, we construct intrinsic Riemannian codebooks by employing K-Karcher-means clustering and Riemannian Gaussian Mixture Model, and consequently extend the Riemannian manifold version of three well studied encoding methods in Euclidean space, i.e. Bag of Visual Words (BoVW), Vector of Locally Aggregated Descriptors (VLAD), and Fisher Vector (FV), to obtain the final action video representations. Our method is evaluated in two tasks on four popular realistic datasets: action recognition on YouTube, UCF50, HMDB51 databases, and action similarity labeling on ASLAN database. In all cases, the reported results achieve very competitive performance with those most recent state-of-the-art works.Comment: 10 page

    Learning Class Prototypes via Structure Alignment for Zero-Shot Recognition

    Full text link
    Zero-shot learning (ZSL) aims to recognize objects of novel classes without any training samples of specific classes, which is achieved by exploiting the semantic information and auxiliary datasets. Recently most ZSL approaches focus on learning visual-semantic embeddings to transfer knowledge from the auxiliary datasets to the novel classes. However, few works study whether the semantic information is discriminative or not for the recognition task. To tackle such problem, we propose a coupled dictionary learning approach to align the visual-semantic structures using the class prototypes, where the discriminative information lying in the visual space is utilized to improve the less discriminative semantic space. Then, zero-shot recognition can be performed in different spaces by the simple nearest neighbor approach using the learned class prototypes. Extensive experiments on four benchmark datasets show the effectiveness of the proposed approach.Comment: To appear in ECCV 201
    • …
    corecore